Zen Match is our causal game introduced at 2021. Each day nearly 2 million user plays the game generate TBs of data. Today we prepared a small dataset from the data and we prepared a challenge to solve.
As we discussed in our blog difficulty effects the gaming experience. In short, skilled players prefer the game harder while new comers in the genre would rather have easier experience, see https://science.goodjobgames.com for more.
Figure 1: The skills of the segments - we assume constant across the levels, which is apparently not in the case in real historical data -, and the skill difficulty difference of the different segments. That shows newbies enters the frustration much faster than average players, while skilled players are always in boredom area.
The aim of this workshop to understand the players abilities, skills, using the real historical data. We will try to estimate how many retry the players will be in need to pass the levels before they reach that specific level.
At the end of the workshop we prepare a betting game. The betting sheet contains pid(players id), lid(level), ou(over/under), (retry)retry quoted value, odd. The question is would you prefer to play the bet or not. At the end we will compare the profit and loss of our findings compared to betting randomly.
library(dsws)
#install.packages("vroom")
#install.packages("foreach")
#install.packages("tidyverse")
library("vroom")
library("tidyverse")
## Registered S3 methods overwritten by 'readr':
## method from
## as.data.frame.spec_tbl_df vroom
## as_tibble.spec_tbl_df vroom
## format.col_spec vroom
## print.col_spec vroom
## print.collector vroom
## print.date_names vroom
## print.locale vroom
## str.col_spec vroom
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_character() masks vroom::col_character()
## ✖ readr::col_date() masks vroom::col_date()
## ✖ readr::col_datetime() masks vroom::col_datetime()
## ✖ readr::col_double() masks vroom::col_double()
## ✖ readr::col_factor() masks vroom::col_factor()
## ✖ readr::col_guess() masks vroom::col_guess()
## ✖ readr::col_integer() masks vroom::col_integer()
## ✖ readr::col_logical() masks vroom::col_logical()
## ✖ readr::col_number() masks vroom::col_number()
## ✖ readr::col_skip() masks vroom::col_skip()
## ✖ readr::col_time() masks vroom::col_time()
## ✖ readr::cols() masks vroom::cols()
## ✖ readr::date_names_lang() masks vroom::date_names_lang()
## ✖ readr::default_locale() masks vroom::default_locale()
## ✖ dplyr::filter() masks stats::filter()
## ✖ readr::fwf_cols() masks vroom::fwf_cols()
## ✖ readr::fwf_empty() masks vroom::fwf_empty()
## ✖ readr::fwf_positions() masks vroom::fwf_positions()
## ✖ readr::fwf_widths() masks vroom::fwf_widths()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::locale() masks vroom::locale()
## ✖ readr::output_column() masks vroom::output_column()
## ✖ readr::problems() masks vroom::problems()
library("foreach")
##
## Attaching package: 'foreach'
##
## The following objects are masked from 'package:purrr':
##
## accumulate, when
The tidyverse is an opinionated collection of R packages designed for data science, We will use dplyr and ggplot packages. For foreach functionality we need foreach package,
#install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
#install_cmdstan(dir = "~/.cmdstan/cmdstan-2.30.1", cores = getOption("mc.cores", 6), overwrite = TRUE, version = "2.30.1", quiet = TRUE)
library(cmdstanr)
## This is cmdstanr version 0.5.3
## - CmdStanR documentation and vignettes: mc-stan.org/cmdstanr
## - CmdStan path: /Users/guemues/.cmdstanr/cmdstan-2.29.2
## - CmdStan version: 2.29.2
##
## A newer version of CmdStan is available. See ?install_cmdstan() to install it.
## To disable this check set option or environment variable CMDSTANR_NO_VER_CHECK=TRUE.
We need cmdstanr and cmdstan in order to do bayesian inference,
In the dataset we have level id, which also sorted and shows progress. Retry count and superundo and shuffle usage in the level. Superundo and shuffle help the players and increase the probabilirty of passing the level.
Lets see what playing randomly resulted
set.seed(4)
random_play_vector <- sample(c(0,1), replace=TRUE, size=nrow(odds_table))
dsws::score_the_play_vector(random_play_vector, scores = dsws::scores, odds_table = dsws::odds_table)
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
There is 150 users in the dataset and 100 levels ranging from 100 to 200. 50 users and 50 levels used in testing. For users 100 to 150 we do not have the retry information from 150 to 200.
ggplot(train_data %>% mutate(retrial_count= pmin(retrial_count, 10)), aes(x=level_id, y= user_id, fill=retrial_count)) +
geom_tile()
Here we see that blue dots are getting more dense while levels are increasing, and some players are passign the levels without retrying.
Lets check our assumptions with aggregated data, is the game getting harder with new levels:
retry_counts_per_level <- train_data %>% group_by(level_id) %>% summarise(retrial_count=sum(retrial_count))
ggplot(data=retry_counts_per_level, aes(x=level_id, y= retrial_count)) +
geom_line() +
geom_smooth(span = 0.2)+
geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Our assumption is to retry count in level i by user j, \(t_{ij}\) is distributed with a geometric distribution \[ t_{ij} \sim Geometric (p_{ij}) \] where p is the probability of the successful passing in one retry of level j by user i. If we manage to estimate \(p_{ij}\) we can estimate how many retries will be required by the user i for level j to pass.
To start with the simplest case let’s assume \(p_{ij}\) dependent only on the level and the user. In other words, each player has different equal distribution on the level j.
\[ p_{ij} = inv\_logit(intercept + u_i + l_j) \]
\[ u_i \sim normal (0, alpha_u) \] \[ l_j \sim normal (0, alpha_l) \] \[ alpha_u \sim exponential (1) \] \[ alpha_l \sim exponential (1) \]
\(alpha_u\) and \(alpha_l\) are hyper priors for multi level model. We only observe \(t_{ij}\)while other parameters have latent effects. Check https://nicholasrjenkins.science/tutorials/bayesian-inference-with-stan/mm_stan/ for more on stan and multi level modelling.
Lets first define the constant across trainings and simulations. You can change chains and iter sampling acording to you computer performance.
ITER_SAMPLING = 150
CHAINS = 4
MIN_TEST_USER_ID = min(dsws::test_data$user_id)
MIN_TEST_LEVEL_ID = min(dsws::test_data$level_id)
SAMPLE_COUNT = ITER_SAMPLING * CHAINS
FIRST_LEVEL = min(dsws::train_data$level_id)
model_1 <- cmdstan_model('./geometric_model_00.stan')
input_list <- list(
N = nrow(dsws::train_data),
user_id = dsws::train_data$user_id,
N_of_user_id = max(dsws::train_data$user_id),
level_id = dsws::train_data$level_id - FIRST_LEVEL + 1,
N_of_level_id = max(dsws::train_data$level_id) - FIRST_LEVEL + 1,
retrial_count = dsws::train_data$retrial_count,
N_of_test_user = length(unique(dsws::test_data$user_id)),
N_of_test_level = length(unique(dsws::test_data$level_id))
)
fit <- model_1$sample(
data = input_list,
iter_warmup = ITER_SAMPLING,
iter_sampling = ITER_SAMPLING,
chains = CHAINS,
parallel_chains = CHAINS,
show_messages=FALSE
)
test_users_simulated_try_model_1 <- get_simulated_retry(
fit$output_files(),
MIN_TEST_USER_ID,
MIN_TEST_LEVEL_ID
)
## Rows: 600 Columns: 2759
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2759): lp__, accept_stat__, stepsize__, treedepth__, n_leapfrog__, dive...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Lets check the first odd in the table, the odd is 2.46. Lets check is it logical to play this odd or if not.
## `summarise()` has grouped output by 'user_id'. You can override using the
## `.groups` argument.
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
New assumption is to retry count in level i by user j, \(t_{ij}\) is distributed with a geometric distribution \[ t_{ij} \sim Geometric (p_{ij}) \] where p is the probability of the successful passing in one retry of level j by user i with shuffle usage \(shuffle_{ij}\) and superundo usage \(superundo_{ij}\). If we manage to estimate \(p_{ij}\) we can estimate how many retries will be required by the user i for level j to pass.
Let’s assume \(p_{ij}\) dependent on the level and the user and the perks used during the level. In other words, not only the levels but the perks like superundo and shuffle are effective on \(p_{ij}\).
\[ p_{ij} = inv\_logit(intercept + u_i + l_j + b_{superundo} * superundo_{ij}+ b_{shuffle} *shuffle_{ij}) \]
\[ u_i \sim normal (0, alpha_u) \] \[ l_j \sim normal (0, alpha_l) \] \[ alpha_u \sim exponential (1) \] \[ alpha_l \sim exponential (1) \] \[ b_{superundo} \sim normal (0, 1) \] \[ b_{shuffle} \sim normal (0, 1) \]
\(alpha_u\) and \(alpha_l\) are hyper priors for multi level model. We only observe \(t_{ij}\)while other parameters have latent effects. Check https://nicholasrjenkins.science/tutorials/bayesian-inference-with-stan/mm_stan/ for more on stan and multi level modelling.
model_2 <- cmdstan_model('geometric_model_01.stan')
input_list <- list(
N = nrow(train_data),
user_id = train_data$user_id,
N_of_user_id = max(train_data$user_id),
level_id = dsws::train_data$level_id - FIRST_LEVEL + 1,
N_of_level_id = max(dsws::train_data$level_id) - FIRST_LEVEL + 1,
retrial_count = train_data$retrial_count,
N_of_test_user = length(unique(test_data$user_id)),
N_of_test_level = length(unique(test_data$level_id)),
superundo = dsws::superundo,
shuffle = dsws::shuffle
)
fit <- model_2$sample(
data = input_list,
iter_warmup = ITER_SAMPLING,
iter_sampling = ITER_SAMPLING,
chains = CHAINS,
parallel_chains = CHAINS,
show_messages=FALSE
)
test_users_simulated_try_model_2 <- dsws::get_simulated_retry(
fit$output_files(),
MIN_TEST_USER_ID,
MIN_TEST_LEVEL_ID
)
## Rows: 600 Columns: 2761
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2761): lp__, accept_stat__, stepsize__, treedepth__, n_leapfrog__, dive...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
model_2_play_vector <- test_users_simulated_try_model_2 %>%
merge(dsws::odds_table) %>%
group_by(user_id, level_id) %>%
summarise(p=sum(value < line)/n(), odd=median(odd)) %>%
mutate(suggested_odd=(1 - p) / p + 1) %>%
mutate(play=odd>suggested_odd) %>%
pull(play)
## `summarise()` has grouped output by 'user_id'. You can override using the
## `.groups` argument.
dsws::score_the_play_vector(model_2_play_vector, scores = dsws::scores, odds_table = dsws::odds_table)
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
In section 3 we try to check if the game becomes harder or not by checking the retry counts per levels. Yet we could not see that it is the case. The blue line that shows the trend has very low derivative. Now we have difficulties as model results. Lets check it is still in the case or not?
levels <- vroom(fit$output_files(), comment = '#', delim = ',') %>%
select(starts_with('l.')) %>%
tidyr::pivot_longer(everything()) %>%
tidyr::separate(name, c("_", "level_id"), sep = "\\.") %>%
mutate(level_id = as.numeric(level_id) + 100) %>%
group_by(level_id)%>%
summarise(m=mean(value))
## Rows: 600 Columns: 2761
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2761): lp__, accept_stat__, stepsize__, treedepth__, n_leapfrog__, dive...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(data=levels, aes(x=level_id, y= m)) +
geom_line() +
geom_smooth(span = 0.1)+
geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
There is now a new clear trend, do you have any idea why?